Re-analysis of an outbreak of Shiga toxin-producing Escherichia coli O157:H7 associated with raw drinking milk using Nanopore sequencing

The aim of this study was to compare Illumina and Oxford Nanopore Technology (ONT) sequencing data to quantify genetic variation to assess within-outbreak strain relatedness and characterise microevolutionary events in the accessory genomes of a cluster of 23 genetically and epidemiologically linked isolates related to an outbreak of Shiga toxin-producing Escherichia coli O157:H7 caused by the consumption of raw drinking milk. There were seven discrepant variants called between the two technologies, five were false-negative or false-positive variants in the Illumina data and two were false-negative calls in ONT data. After masking horizontally acquired sequences such as prophages, analysis of both short and long-read sequences revealed the 20 isolates linked to the outbreak in 2017 had a maximum SNP distance of one SNP between each other, and a maximum of five SNPs when including three additional strains identified in 2019. Analysis of the ONT data revealed a 47 kbp deletion event in a terminal compound prophage within one sample relative to the remaining samples, and a 0.65 Mbp large chromosomal rearrangement (inversion), within one sample relative to the remaining samples. Furthermore, we detected two bacteriophages encoding the highly pathogenic Shiga toxin (Stx) subtype, Stx2a. One was typical of Stx2a-phage in this sub-lineage (Ic), the other was atypical and inserted into a site usually occupied by Stx2c-encoding phage. Finally, we observed an increase in the size of the pO157 IncFIB plasmid (1.6 kbp) in isolates from 2019 compared to those from 2017, due to the duplication of insertion elements within the plasmids from the more recently isolated strains. The ability to characterize the accessory genome in this way is the first step to understanding the significance of these microevolutionary events and their impact on the genome plasticity and virulence between strains of this zoonotic, foodborne pathogen.

Variant calling of the Illumina sequencing data identified 23 isolates that fell within the same 5-SNP single linkage cluster.The 20 isolates linked to the outbreak in 2017 had a maximum SNP distance of 1 SNP between each other, and a maximum of 5 SNPs when including cases identified in 2019 (Fig. 1).Analysis of the Nanopore data identified a maximum of 1 SNP variants between the 20 outbreak isolates and 5 SNPs between all 23 isolates (Fig. 1).
The variants identified in the Illumina sequencing data were compared to those identified in the Nanopore data.There were seven discrepant variants called between the two technologies (Table 1).Four of the seven mismatches were determined to be false positive variants in the Illumina dataset due to ambiguously aligning short reads (Table 1).A fifth, this time false negative, variant was called in the Illumina short read data in a single sample due to low coverage over a region of the genome where the variant was called (Table 1).The remaining two discrepant variants were deemed to be false negatives in the Nanopore data due to errors associated with calling SNPs within homopolymer sequences (Table 1).
When accounting for the above false positive/false negative discrepant variants there remained only a single discrepant variant which was classified as a true variant [Nanopore (G var/T reference), Illumina (N/T reference)], accounting for the single SNP difference in sample 432,299 (Fig. 1).
This comparison highlighted the systemic differences associated with each technology, specifically the basecalling errors related to homopolymer detection observed in Nanopore data 23,24 and the ambiguous alignment of Illumina data to homologous and paralogous regions 25 .This comparison also demonstrates the importance of masking these regions within the reference genome to produce accurate and meaningful results 23 .

Genomic features of long-read assemblies of the chromosome
The assemblies of the sequences of the 23 isolates in this study all contained either two or three closed contigs, each supporting a single replicon (Table 2).The chromosome size of the isolates sequenced in this study (n = 23) ranged from 5,507,151 to 5,555,878 bp in length, a maximum difference of 48,727 bp and an average size of 5,553,177 bp (Table 2).
Aligning and comparing the chromosomes of all 23 isolates within the same 5-SNP single linkage cluster led to the discovery of a large-scale recombination (LCR) event present in one of the cattle isolates (432,301) (Fig. 2).The large-scale recombination event was characterised by a 650kbp inversion between prophages 5 (potC) and 7 (yebW) (Fig. 2).At either edge of the inversion, prophages 5 (potC) and 7 (yebW) both have a 10.05kbp homologous sequence containing prophage structure encoding genes such as tail proteins, host specificity proteins and several hypothetical genes.The 10.05kbp homologous sequences in both prophages share a 97.7% sequence similarity.
It is known that the STEC O157:H7 genome undergoes large scale recombination to produce large inverted sequences within the chromosome 26,27 .The outbreak in this study was selected because it included eight isolates from cattle, and we wanted to look for LCRs in STEC O157:H7 in the animal reservoir, as well as in the RDM and human cases.We observed a LCR in just one cattle isolate; it is uncertain whether this genetic event occurred in vivo or on sub-culture in the laboratory.Potential phenotypic effects of LCRs, for example strain fitness, infectivity, or impact on patient outcomes, are yet to be determined.Within this 5-SNP cluster of 23 isolates, minimal  www.nature.com/scientificreports/large-scale chromosomal variation was observed, regardless of the source (animal, food or human), the clinical outcome of the case or the year the cases were detected.

Analysis of prophage and prophage-like content
All 23 isolates contained the same number of prophages (n = 17) of which two were stx2a-encoding prophage (Figs. 2, 3).All prophages in the samples sequenced ranged from 8.2 to 144.5 kbp in size (Table S1).Across all 23    ,4).There were 2/17 prophages that showed variation, the first type of variation was a single deletion of a 47,389bp region in a large compound prophage 5 (potC) in a cattle isolate (432,299), different to the isolate exhibiting the LCR (Figs. 2, 4, 5).The deletion event was surrounded by two 5.8 kbp homologous sequences containing structural tail proteins and hypothetical genes.The second type of variation observed in prophages 5 (potC) and 7 (yebW) were related to the 0.65 Mbp LCR observed where these prophages acted as the break points.
Here, and in previous studies, pairwise comparisons revealed several common non-stx-encoding prophages found in strains of in STEC O157:H7 that are temporally and geographically distinct 21 .This indicates that certain prophages are stable within the STEC O157:H7 genome and perhaps can no longer be induced.However, in contrast to the above, some prophages which have previously been described as non-inducible have been shown to be mobile 14,16 .In the limited dataset included in our analysis to date, strains from the same sub-lineage and those more closely in time and space, had more prophages in common than those strains that were phylogenetically, geographically and temporally distinct 21,28 .Loss and acquisition of prophage content may be influenced by both time and geographical setting.

Analysis of Shiga toxin-encoding prophages
All 23 isolates in the outbreak cluster harboured two stx2a-encoding prophages (Figs. 2, 3) previously undetected via Illumina sequencing; one of which was integrated at the stx-encoding bacteriophage integration site (SBI) argW, while the other was integrated at sbcB.The stx2a-encoding bacteriophage in sub-lineage Ic is commonly found at argW, and phylogenetic analysis showed that this bacteriophage clustered with stx2a-encoding phage within sub-lineage Ic (Fig. 3) 21 .The sbcB SBI is more commonly occupied by stx2c-encoding bacteriophages, and phylogenetic analysis showed that the stx2a-encoding bacteriophage integrated at sbcB in the outbreak strain was located on a branch mainly comprising stx2c-encoding bacteriophage 21 .
Previous studies have described the loss of stx2c-encoding phage and subsequent acquisition of stx2a-encoding phage exhibiting similar sequences to stx2c-encoding bacteriophage at the same SBI, in sub-lineage IIb 10,21 , however, this is the first report of this phenomenon occurring in sub-lineage Ic.Strains of STEC O157:H7 harbouring more than one stx2a prophage have been described previously 29 , but again this is the first report of the acquisition of two different stx2a-encoding phage in this sub-lineage.We previously showed that strains harbouring stx2a only belonging to sub-lineage Ic are significantly more likely to be associated with severe clinical outcomes than those strains harbouring stx2a only in sub-linage IIb 6 .The representative strains of STEC O157:H7 sub-lineage IIb in previous studies 10 had only one type of stx2a-encoding phage; the presence of two different stx2a-encoding phage may play a role in enhancing pathogenicity in sub-lineage Ic.

Plasmid analysis
All isolates contained an IncFIB plasmid, the pO157 that is characteristic of STEC O157:H7, ranging in size from 93,689 to 94,050 between the outbreak isolates from 2017, a maximum difference of 361bp and an average size of 93,997bp.In the three temporally distinct isolates from 2019, the size range increased to 95,369bp with a maximum difference of 1,680bp.The approximately 1.6kbp increase between IncFIB plasmids in isolates from 2019 compared to those from 2017, is due to the duplication of insertion elements within those plasmids (Fig. 6).One isolate (435,354) from 2017, also contained a smaller IncI1-γ plasmid, 85,962by in size (Table 2a).

Conclusions
In this study, we evaluated our bioinformatics approach to analysing long read sequencing data in an outbreak setting and showed the results of these analyses correlated well with the bioinformatics pipelines routinely employed for analysing the short-read sequencing data.Minimal LCRs and/or prophage variation was observed within the isolates linked to this point source outbreak of STEC O157:H7 PT21/28 caused by the consumption  www.nature.com/scientificreports/Sakai were masked from further analyses.5-methylcytosine positions were identified using Nanopolish V0.11.3 17 and then masked from the Nanopore VCFs as described in Greig et al. 51 .The final number of positions masked was 1,189,993 bp, leaving a final reference of 4,308,457 bp.The maximum likelihood phylogenetic tree was constructed by RAxML v8.1.17 52using an alignment generated from SnapperDB 12 that recombination had been accounting for by Gubbins v2.00 53 .Visualisation of the phylogenetic tree was performed using FigTree v1.4.4 43 .To detect false positive/negative SNPs called by Illumina and Nanopore reads, discrepant variant positions between Illumina and Nanopore relative to the reference genome were extracted.The aforementioned variants and those that also had a lower-than-average mapping quality were then masked in the alignment.

Detection and characterisation of chromosomal structural variation
Chromosome synteny was compared by aligning outbreak sample chromosomes using Easyfig v2.2.5 44 .Once samples in one chromosome were aligned, structural differences could be determined and further characterised using Artimis v18.1.0 54.
To determine if there were multiple isoforms within each sample's reads (FASTQ).The FASTQ for an outbreak sample in one isoform was aligned to a finalised assembly with a different isoform using Minimap2 v2.17 48 and Samtools v0.7.17 49 .Using Tablet v1.17.08.17 55 , the alignments were visualised and the breakpoints at each isoform where identified.Once breakpoints were identified relative to each isoform, those positions were used with Samtools v0.7.17 49 view to isolate reads that must align across both ends of each prophage breakpoint.Any reads that did align across a given set of breakpoints must share the same size as it exists in the FASTQ file and not clipped, to be considered.

Data deposition
Illumina and Nanopore FASTQ files are available from National Center for Biotechnology Information (NCBI) BioProject PRJNA315192.The SRA (sequence read archive) accession numbers for both technologies are in supplementary Table 2.The outbreak sample finalised assemblies can also be found under BioProject PRJNA315192 and the GenBank accession numbers are located in both Table 2 and supplementary Table 2.

Figure 1 .
Figure 1.A maximum-likelihood phylogeny showing both Illumina derived and nanopore derived SNP-typing results for samples sequenced in this study.

Figure 2 .
Figure 2. Easyfig alignment showing the chromosome and loci of prophages in all samples sequenced in this study.Stx-encoding prophage, Red; Prophage-like region, Blue; Locus of Enterocyte Effacement (LEE), Green and other non-stx-encoding prophages, Black.

Figure 3 .
Figure 3. Neighbour joining tree based on Jaccard distances of stx-encoding prophages of publicly available samples and the outbreak samples sequenced in this study.Prophages are coloured by sub-lineage of STEC O157:H7.Sub-lineage Ia, Green; Ib, Yellow; Ic, Red; I/IIa, Blue; I/IIb, Grey; IIa, Orange; IIb, Black and IIc, Purple.

Figure 4 .
Figure 4. Mid-rooted neighbour-joining trees of Jaccard distances showing prophages from samples sequenced in this study with prophages from BA000007 (Sakai) (A), STEC O157:H7 strain 9000 (B) and STEC O157:H7 strain 397,404 (C).In each diagram prophages grouped by green are prophages shared in samples and reference genome; red are reference genome only and yellow are prophages unique samples sequenced in this study.

Table 1 .
The discrepant variant calls between illumina and nanopore datasets for all outbreak samples.

Table 2 .
Table detailing the ID of each strain, finalised assembly metrics, plasmid replicon typing, prophage counts and assembly-based accessions.a , b and c detail strains shared by the same case/patient.